Indexing methods
The linear hashing method is used to provide key based lookup. A key is a unique value used to identify a record. In
Many institutions have numbering sequences that uniquely identify an object or event. In some instances the number is almost always unique apart from a few exceptions.
While the Two Level method dictates how a search should be executed internally, it is a flexible method that supports a wide number of searching variations within the one framework. For example, @
operator. If a word is preceded with \@, all words that sound like the word entered will be found. Phonetic searching uses the same indexing as word based searching even though there are different results. The difference is in the definition of a term:
- For word based searching a term is a word.
- For phonetic searching a term is a series of numbers that represents the sound of the word (using a complex algorithm that converts a word into its basic sound groups).
Indexing methods introduced with
- Null based indexing provides very fast retrieval when searching for fields that are empty / not empty.
- partial based indexing offers full index support for wildcard queries that specify leading letters.
Both these methods use the Two Level scheme. They simply provide different terms based on the type of indexing required.
The basic searching provided in
Relax. You know you're in safe hands.
consists of the words:
relax
you
know
youre
in
safe
hands
Notice how the punctuation is removed and the case is converted to lower case (searching is case insensitive). The words above are the search terms that can be used to locate the sample string. When word based indexing is used, extra terms are generated for word pairs (two consecutive words within the same sentence). For the above sample the word pairs generated are:
relax you
you know
know youre
youre in
in safe
safe hands
The word pairs provide direct support for phrase based searching, that is a search where the words are enclosed in double quotes (e.g. "in safe hands"). Enabling word based searching automatically provides phrase based searching.
In many instances it may be useful to search for variations on a base word regardless of the tense or form of the word used. Consider a search for the word election:
- A word based search will return all occurrences of the word election, but it will not find elect, elected, elector, electing, etc.
- A stem based search, which is specified by preceding a word with a tilde (
~
), will find all variations of the word.
In stem based indexing the basic term is the root of the word (in this case elect). Entering any variation of the word preceded by \~ will result in a search for the root word (e.g. searching for \~elected will result in the search term elect). The algorithm to convert a word to its root word is a complex one that has been refined over many years. It is not a simple truncation mechanism, otherwise \~elect would match electric. The rules used to determine the root word are effective for English text only.
Stem based indexing is not enabled on many fields in
When searching for names or scientific terms it is often useful to be able to search for records that contain words that sound like the search term. phonetic based indexing provides this functionality. The basic term for phonetic searching is a number sequence. Each word is converted into a number sequence that encodes the basic sound groups that make up the word. The number sequence is then used for the search and words that generate the same number sequence (and hence sound the same) will be matched.
A phonetic search is specified using the @
character before a word, e.g. \@term.
The algorithm used to convert a word into its basic sound group is very complex as letters can take on different sounds depending on the letters that surround them. A number of refinements to the basic algorithm have been made over many years.
Phonetic based indexing is not enabled on many fields in
Occasionally you may want a search to return records where only the exact value entered in a field is matched. String based indexing uses the complete contents of the field as the term for retrieval and only records that exactly match the search terms will be returned. For example, two records have either one or the other of the following values in a suburb field:
Melbourne
North Melbourne
A word based search using Melbourne
as the search term will return both records (as they both contain the word Melbourne). A String based search, using Melbourne will only return the first record as only one record contains Melbourne and nothing else.
String based searching is not used widely in
Some forms of data may be searched by locating all records between start and end points, otherwise known as a range search.
- Less than (<)
- Less than or equal to (<=)
- Greater than (>)
- Greater than or equal to (>=)
To search for all records inserted between 1 January 2006 and 28 February 2006, in the Date Inserted field you would specify:
>="1 January 2006" <="28 February 2006"
Note the use (and position) of the double quotes as the dates contain spaces.
Range based searching is a variation on the standard Two Level scheme used by the other indexing methods. Essentially it divides the possible values for a range based field into a series of discrete intervals. Then, for each field, value terms are generated to indicate into which intervals the value falls. Using these terms it is possible to provide efficient range based searches.
Sometimes you may only be interested in whether a field contains a value or not. For example, to determine whether records have been assigned to a department, you would use the wildcard search !*
in the Department field (SecDepartment_tab).
NULL based searching provides an additional term that indicates whether the field contains a value or not. This term is used to provide high speed retrieval when certain wildcard searches are used:
\!\*
finds records where a field does not contain a value\*
finds records where a field contains a value
Null based searching can also provide support for searches used to determine whether an attachment field is empty or not:
\!\+
for records that do not have any attachments\+
for records that do have one or more attachments
Most wildcard based searches (i.e. searches looking for patterns or part of a word) specify leading letters. For example, you may want to find all people with a surname starting with Al*
. Since the standard
Partial based indexing provides a mechanism for providing very efficient wildcard searching where leading letters are specified (that is where one or more letters or digits appear at the start of the pattern). The method takes each word in a field and generates a term based on the number of leading letters indexed. Part of the partial based indexing configuration is the number of letters to be used to generate partial terms. For example, a field may be configured to provide partial indexing for the first 1,3,5
letters of each word. If we take the sentence:
Relax. You know you're in safe hands.
the following terms are generated for partial based indexing:
r
|
Thus if we searched for rel\* the three letter terms can be searched to locate the matching records. The searching mechanism makes use of any indexing it can; for example, if you search for re\* the single letter terms are used to retrieve a set of potential matches, which are then checked to see if the second letter matches. Partial based indexing provides the same level of response for wildcard based searches where leading letters are supplied as does word based searching.
Partial indexing is enabled on a small number of fields, typically fields containing names.